Processor mechanisms for software shared memory
نویسنده
چکیده
This thesis describes and evaluates the effectiveness of four hardware mechanisms for software shared memory: block status bits, a global translation lookaside buffer, a fast, non-blocking, event system, and dedicated thread slots for software handlers. These mechanisms have been integrated into the M-Machine's MAP processor, and accelerate tasks which are common to many shared-memory protocols, including detection of remote memory references, invocation of software handlers, and determination of the home node of an address. The M-Machine's mechanisms for shared memory require little hardware to implement, including 3KB of RAM and the register files for the thread slots allocated to shared-memory handlers. Integrating these mechanisms into the processor instead of providing sharedmemory support through an off-chip co-processor reduces the hardware cost of shared memory, eliminates inter-chip communication delays in interactions between the CPU and the shared-memory system, and improves resource utilization by allowing shared-memory handlers to use the same processor resources as user programs. Hardware support for shared memory significantly improves the M-Machine's remote memory access time, allowing remote memory requests to be resolved in as little as 336 cycles, as compared to 1500+ cycles on an M-Machine without hardware support. In program-level experiments, the M-Machine's shared-memory system was shown to allow efficient exploitation of parallelism, achieving a speedup of greater than 4x on an 8-node FFT. The MAP chip's non-blocking memory system was shown to be a significant contributor to the remote memory access time, as operations which reference remote data must be enqueued in a software data structure while they are being resolved. To improve performance, additional mechanisms, including transaction buffers, have been proposed and evaluated. In concert, the additional mechanisms proposed in this thesis reduce the remote memory access time to 229 cycles, improving program execution time by up to 16%. Thesis Supervisor: William J. Dally Title: Professor of Electrical Engineering and Computer Science
منابع مشابه
Transactional Memory
Shared memory parallel architectures present a single unified address space to each processor. Usually the memory is physically distributed across the system but each processor is able to access any part of it through a single address space. The system hardware is responsible for presenting this abstraction to each processor. Communication between processors is done implicitly through normal me...
متن کاملMultigrain Shared Memory Multigrain Shared Memory
Parallel workstations, each comprising a 10-100 processor shared memory machine, promise cost-e ective general-purpose multiprocessing. This thesis explores the coupling of such smallto medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. Multiprocessors built in this fashion are called Distributed Scalable Shared memo...
متن کاملUniprocessor Virtual Memory without TLBs
ÐWe present a feasibility study for performing virtual address translation without specialized translation hardware. Removing address translation hardware and instead managing address translation in software has the potential to make the processor design simpler, smaller, and more energy-efficient at little or no cost in performance. The purpose of this study is to describe the design and quant...
متن کاملArchitectural Support for an Efficient Implementation of a Software-Only Directory Cache Coherence Protocol
Software-only directory cache coherence protocols emulate directory management by handlers executed on the compute processor in shared-memory multiprocessors. While their potential lies in lower implementation cost and complexity than traditional hardware-only directory protocols, the miss penalty for cache misses induced by application data accesses as well as directory accesses is a critical ...
متن کاملSoftware Mechanisms to Identify and Mitigate Intercore Memory Subsystem Shared Resource Contention for Multiprogram Workloads
Multicore processors have become ubiquitous in recent years and have become the norm across embedded, desktop and server markets. This shift in processor design has had profound implications on hardware resource sharing between processes on a system. Rather than allocating a fixed partition of each resource to each core, common resources such as last level caches, integrated memory controllers,...
متن کامل